read in our dataset

we will work with a sample of the data if there are too many datapoints. The full dataset may exceed the capabilities of this notebook's memory, and the random selection aspect of sample will help ensure a represenative set of samples

Preview of the data

Next we want to indentify all string columns, and encode them. This uses OneHotEncoder, but more domain knowledge of the dataset may be able to select some better encodings; for instance, some of the categories may be best mapped to a single int for each category, since they are all on a scale. This preproc will be used in our subsequent pipelines.

Other types of data should be normalized - adjusted to a mean of 0 and scaled to the same basic range. This makes the different components much more comparable and amiable to further analysis.

Principle Component Analysis

Principle component analysis can help us view the inherent dimensionality of the data. Oftentimes the dimensionality of our data is much higher than it really needs to be to adequately explain it, and this analysis helps us determine it.

Note that the category properties each count as a dimension per property with the one hot encoding, so this reduction may yield a higher dimension space than we started with.

MDS

multi dimensional scaling tries to map the data down to a lower number of dimensions while respecting the distances of the data as much as possible. We are scaling it down to 3 dimensions, so a location+color graph can show the resultant shape

Clustering

We can then cluster the data to see if there are obvious groupings within it.

Lets see how many clusters we found

Visualized Clusters

We can combine MDS with clustering to show where the clusters lie in terms of the MDS. Since we will need color for showing clusters, this will be a 3D MDS. Note that the clustering is done in the original space and is only being displayed with reduced dimensionality, so the clusters may not correspond visually with the data.

Outlier Detection

Another Thing to do is try to identify outliers in the data.

First we can see how many outliers we detected

Then we can graph them on the same dimensionality reduction plot as before

We can also just get a list of those outliers and do things with it directly. For instance, we can look at those values. WE could also try filtering out the outliers and re-running the other analysises to see if underlying patterns become more apparent.

Clusters without outliers